-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QQ: introduce a default delivery limit #11937
Conversation
0fde9d0
to
ac83fa2
Compare
If the delivery_limit of a quorum queue is not set by queue arg and/or policy it will now be defaulted to 20.
ac83fa2
to
4746aaf
Compare
To make it more visible that a default is in place. Also added publisher count as it was easy to do so.
4746aaf
to
3a386f4
Compare
QQ: introduce a default delivery limit (backport #11937)
I tracked this PR down after reading about the change in the blog post, and I'm rather concerned by the impact of this change. As it says right on the documentation page, quorum queues are the queue type to choose when you care about about data safety, however with this change, quorum queues are now not safe by default and can lose messages. This feels like a pit of failure scenario. I don't feel like recommending that users have to set up dead letter policy to avoid this message loss is sufficient to justify making this change. If that policy was also created by default for users, then maybe this would be okay. I really think that this change should be reconsidered before releasing 4.0 RTM. |
Hi Brandon, I suspected we'd get a bit of push back on this. That said I am confident that setting.a default delivery limit is the correct choice and if you advise users on how to configure their quorum queues they must use a delivery limit with an optional dead letter configuration. If you don't use a delivery limit a single un-parsable message that cause the consuming service to terminate could threaten not just the availability of the queue itself but potentially also of the system as a whole due to excessive resource / disk use. https://www.rabbitmq.com/docs/next/quorum-queues#repeated-requeues This does not undermine the data safety property that says that when you get a transfer accepted outcome (publisher confirm) the message is safe on disk on multiple nodes. That property remains. It is a tradeoff for sure but the correct one. Other messaging services have similar defaults. There are some things we can discuss such as the default number and what counts as a delivery attempt. Currently all returns are counted, even explicit ones. This is something we'd like to improve on such that only session/channel terminations and modified outcomes with |
@kjnilsson My understanding of @bording comment is that there isn't necessarily a disagreement about the usefulness of such a limit. I think it is more related to how it is being introduced and the potential risk of that especially because the limit is retroactively fit into the broker capability while ASB had it from the beginning. I'm aware that even with ASB the deadlettering is not enabled by default and it is a choice the creator of the entity has to make but this seems to be a different scenario when you look at it from the perspective of migrating from one rabbitmq version to another, enforcing a limit and not having a default safe behavior. I'm not an expert in RabbbitMQ so what I'm spitballing here might be incorrect. But I wonder if it would be better to enforce first having to have a deadletter policy and only after that has been rolled out, enforce the maximum limit. |
@danielmarbach sure and that is why we shipped it (a breaking change) in our first major release for many years along with a whole set of other breaking changes. Delaying this change would probably be more of a surprise if it came in a minor don't you think? |
FWIW, this change is documented not only in a blog post, but more importantly it's documented in section Breaking Changes in the RabbitMQ 4.0 release notes. |
It's not mentioned in the 3.13 version of the docs but is mentioned in the 4.0 version, which is accessible at https://rabbitmq.com/docs/next. Given that it is mentioned in the release notes, I don't know where else we can put it. I doubt the majority of users routinely experience more than 20 redeliveries. However, the goal of the feature is to serve as a guard rail, and 20 is about as good as 40 or 64. |
For NServiceBus, we always want to ensure that we provide a system that is safe by default, and that includes ensuring that the default settings will not have a chance of message loss. This is one of the reasons that we recommend using quorum queues for production systems and why we went through the effort to move our custom delay infrastructure to using quorum queues. Currently in the RabbitMQ transport for NServiceBus, when we create quorum queues, we do not set a delivery limit. This was done intentionally because NServiceBus has its own recoverability mechanisms, and we don't want any sort of native things like dead lettering interfering and removing a message out from underneath us. With the current way our transport creates queues, if one of our customers upgrades to RabbitMQ 4.0, this change means that they go from a system that will ensure a message that cannot be processed will end up in the NServiceBus We will have to modify our transport in some way to prevent this, but this is going to be a major undertaking as well. Some options I can see:
One of the big problems here is that there is no way to interact with policies through the AMQP client. We can't create them or check for the existence of them in any way. That means the only way we have at the transport level to interact with these settings is through setting the values directly on the entities when they are declared. Both of these approaches require new settings on the queues, so that means we now have a migration problem. Since our existing customers already have production systems without any of these settings, we can't just add parameters to the queue declarations. That means that we will have to build migration tools that copy existing messages to temp queues, delete the existing queues and them redeclare with the new parameter values, move the messages back from the temp queues, and then delete the temp queues. This is tricky to get right, and I really don't want to have to build this sort of tool if we can avoid it. While this specific scenario is about NServiceBus customers, it's easy to imagine how a similar problem could occur for any users of RabbitMQ that have not previously had a delivery limit set.
While we could definitely make this recommendation to our customers, we can't rely on them doing this. We have to have some way to programmatically ensure that things have been configured in a way that meets our safe-by-default criteria. Policies being completely opaque to the transport via the .NET client makes this not an option for us.
Azure Service Bus does have a delivery limit, but they also provide a dead letter queue by default that isn't optional for every queue you create. This ensures that any message that hits the delivery limit can be inspected and retried. In my original post, I said that recommending a default dead letter policy be manually set up is not sufficient, and I still think that is true. If a default delivery limit is going to lead to messages being deleted, then I think the default broker configuration should also ensure that these messages are also delivered to a dead letter queue.
The breaking change document you linked to does mention this, but it doesn't say anything about the potential impact. It also doesn't suggest creating a default dead letter policy like this PR suggests. While the docs that are linked to do mention this, I would want to see this called out explicitly in the breaking changes document. It's too easy to miss otherwise. |
@bording let's not get too carried away. Release notes are not mean to be documentation guides. They can and so link to documentation guides. No amount of warnings will help those whose apps get into a crash-requeue-crash-requeue loops and experience more than 20 deliveries. Somehow no one in this discussion considers the amount of time our team spends on cases where another super-duper killer app gets into a crash-requeue-crash-requeue loop and drives nodes out of disk space, or fails to acknowledge a message (for which we have introduced a forced guard rail some have made a big deal about, even in environments where the limit was bumped to 30 days!) It's RabbitMQ that gets the blame for such scenarios, every single time. |
As of e7296c1, the 4.0 release notes not only feature a recommendation in the post-upgrade section but also mention the consequences (for the tiny percentage of systems where 20 redeliveries per message is a common condition), link to the DLX, policies and QQ guides in both Breaking Changes and Upgrading to 4.0 sections. Anything beyond that belongs to the documentation guides, not release notes. |
@michaelklishin, with respect, @bording is not getting carried away. He is simply pointing out a lot of danger that, at the time of his comment, lurked hidden within a scant 11 words in the release notes. e7296c1 is a good start. He is also not minimizing the original reasons behind the change or asking for anything radical like reverting the whole thing, but only asking that some mechanism be put in place to make the change safe-by-default, as opposed to safe-if-you're-lucky. If there was a way to redefine a queue to set an explicit redelivery limit without recreate-and-migrate, everything would be fine. If there was a way to define or even just verify that redelivery policies exist over the AMQP protocol, everything would be fine. But right now, the way I understand the conversation above, there's no way for a client that only has AMQP access to verify that a queue that was guaranteed not to lose messages in v3 is still safe in v4. |
@DavidBoike there is such a mechanism, and it has been around for as long as the redelivery limit in quorum queues (since 2018 or so?):
A policy update is a runtime operation (does not require That specific doc section links to another one on policies, as now do release notes, and very soon other doc sections will extended to make it dead obvious what the options are. The only problem with this? Some users never read any documentation, any release notes. We have been there before with the delivery acknowledgement timeout mentioned above. |
(cherry picked from commit c2bb67a)
c2bb67a further modifies the release notes to mention that the limit can be increased. rabbitmq/rabbitmq-website@23de344, which will be live at https://www.rabbitmq.com/docs/next/quorum-queues#poison-message-handling in some five minutes, provides examples of how to define a policy that overrides the limit and another example that both overrides the limit and sets up dead lettering to an exchange. |
@michaelklishin thanks for pointing out the cli, http, powershell and management UI approaches. In regards to
Has the team given consideration to support those operations in the client SDKs by introducing some form of "admin / management client" similar to what azure service bus provides? |
We have total empathy for that. As a platform company we also get a fair amount of discussions with emotions involved. Our intent here was to chime in to better understand the reasons behind the change, show the problems we are exposed to and find together with you good approaches for everyone involved. |
@danielmarbach there is an AMQP 1.0 "management client" but in its current version of #10559, it does not manage policies. |
If NSB handles this itself there should be no risk of you ever reaching the limit surely? I presume there is some means of detecting re-occurring errors and isolating poison messages already? |
After discussion we've decided to introduce a way to disable the delivery limit to restore 3.13.x behaviour. It will still have to be explicitly configured and the default will remain. |
If the delivery_limit of a quorum queue is not set by queue arg and/or policy it will now be defaulted to 20.
Deliver limits are recommended for safe and smooth quorum queue operations. The downside is that poorly written applications and RabbitMQ configurations that frequently return messages to the queue now could be dropped when they reach the delivery limit. This is a breaking change in terms of some applications.
Hence, it will be recommended to put a blanket, low priority policy in place to set a dead letter configuration for all quorum queues to dead letter dropped messages to, e.g. a single stream, at least until applications and configurations can be properly updated.
This PR also exposes the delivery limit on the queue page (and adds a publisher count because it was easy).